Building Customer Lifetime Models
In this workbook we investigate different ways to model the lifetime of individual customers, with a view to expanding this approach with our frequency into a more traditional P/NBD model.
1 Load and Configure Datasets
We first want to load some synthetic transaction data with a fixed transaction frequency which allowed for a varying lifetime.
customer_cohort_tbl <- read_rds("data/synthdata_lifetime_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()## Rows: 50,000
## Columns: 4
## $ customer_id <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C20110…
## $ cohort_qtr <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", …
## $ cohort_ym <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01", …
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
customer_simparams_tbl <- read_rds("data/synthdata_lifetime_simparams_tbl.rds")
customer_simparams_tbl %>% glimpse()## Rows: 50,000
## Columns: 9
## $ customer_id <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C2011…
## $ cohort_qtr <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1",…
## $ cohort_ym <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01",…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-…
## $ customer_mu <dbl> 0.087383599, 0.009023485, 0.113799661, 0.016847819, 0.…
## $ customer_tau <dbl> 13.4008628, 17.2527509, 0.9406138, 155.2154564, 61.892…
## $ customer_lambda <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
## $ customer_nu <dbl> 0.94528278, 1.76830099, 1.39246239, 0.36050074, 0.1393…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
customer_transactions_tbl <- read_rds("data/synthdata_lifetime_transactions_tbl.rds")
customer_transactions_tbl %>% glimpse()## Rows: 135,337
## Columns: 5
## $ customer_id <chr> "C201101_0001", "C201101_0002", "C201101_0002", "C201101…
## $ cohort_qtr <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "…
## $ cohort_ym <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01", "…
## $ tnx_timestamp <dttm> 2011-01-01 14:14:36, 2011-01-01 06:38:12, 2011-01-23 19…
## $ tnx_amount <dbl> 100.25, 52.76, 52.22, 80.54, 315.04, 289.44, 290.18, 309…
We also want to set up a number of parameters for use in this workbook
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"1.1 Calculate Summary Statistics
We also want to calculate the summary statistics for each of these customers based on an observation date of Jan 1 2020.
customer_summarystats_tbl <- customer_transactions_tbl %>%
calculate_transaction_cbs_data(
last_date = as.Date("2020-01-01")
)
customer_summarystats_tbl %>% glimpse()## Rows: 50,000
## Columns: 6
## $ customer_id <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C20110…
## $ first_tnx_date <dttm> 2011-01-01 14:14:36, 2011-01-01 06:38:12, 2011-01-01 1…
## $ last_tnx_date <dttm> 2011-01-01 14:14:36, 2011-01-23 19:21:11, 2011-01-01 1…
## $ x <dbl> 0, 1, 0, 19, 4, 2, 1, 0, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, …
## $ t_x <dbl> 0.000000, 3.218550, 0.000000, 151.009731, 23.709454, 23…
## $ T_cal <dbl> 469.4866, 469.5319, 469.5039, 469.4506, 469.5086, 469.5…
1.2 Visualise the Transaction Data
As always, we want to visualise this data as much as we can to get a feel for it, so we construct a plot of each customer.
tnx_plot_tbl <- customer_transactions_tbl %>%
group_nest(customer_id, .key = "tnx_data") %>%
filter(map_int(tnx_data, nrow) > 1) %>%
slice_sample(n = 50) %>%
unnest(tnx_data)
ggplot(tnx_plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Transactions for Sample of Customers"
) +
theme(
axis.text.y = element_text(size = 8)
)As part of the lifetime analysis, we need to take a look at times between transactions and we use this as part of our analysis later.
For the moment, we just want to see how the time differences work.
customer_timediffs_tbl <- customer_transactions_tbl %>%
group_by(customer_id) %>%
mutate(
tnx_timediff = difftime(
tnx_timestamp, lag(tnx_timestamp),
units = "weeks") %>%
as.numeric()
) %>%
ungroup()
ggplot(customer_timediffs_tbl %>% drop_na(tnx_timediff)) +
geom_histogram(aes(x = tnx_timediff), bins = 50) +
scale_y_continuous(labels = label_comma()) +
labs(
x = "Weeks",
y = "Count",
title = "Histogram of Weeks Between Successive Transactions"
)We can see that almost all time differences are less than 75 weeks, but it is worth looking at a summary of the raw data.
customer_timediffs_tbl %>% pull(tnx_timediff) %>% summary()## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.86 4.53 6.84 9.32 85.67 50000
Considering this data, we can say that a customer has ‘died’ in our model 100 weeks after the most recent transaction. We can use this information in our model.
1.3 Construct the Model Data
We take a analagous approach for our lifetime model as we did for the frequency model. We take a subset of the data and use the rest of the data to validate the model.
We also need to exclude all customers with just a single transaction on our system as the first transaction is considered the ‘birth’ of the customer and it is second and subsequent transactions that help us calculate statistics.
break_date <- as.Date("2018-01-01")
full_weeks <- difftime(break_date, as.Date("2018-01-01"), units = "weeks") %>%
as.numeric()
training_tbl <- customer_transactions_tbl %>%
filter(
tnx_timestamp <= break_date
) %>%
group_nest(customer_id, .key = "tnx_data") %>%
filter(map_int(tnx_data, nrow) > 1) %>%
slice_sample(n = 10000) %>%
unnest(tnx_data)
training_tbl %>% glimpse()## Rows: 40,058
## Columns: 5
## $ customer_id <chr> "C201610_0076", "C201610_0076", "C201610_0076", "C201610…
## $ cohort_qtr <chr> "2016 Q4", "2016 Q4", "2016 Q4", "2016 Q4", "2015 Q1", "…
## $ cohort_ym <chr> "2016 10", "2016 10", "2016 10", "2016 10", "2015 02", "…
## $ tnx_timestamp <dttm> 2016-10-05 04:02:25, 2016-10-06 13:50:03, 2016-10-08 10…
## $ tnx_amount <dbl> 125.56, 109.38, 116.17, 84.56, 140.57, 181.75, 168.09, 1…
test_tbl <- customer_transactions_tbl %>%
semi_join(training_tbl, by = "customer_id") %>%
filter(
tnx_timestamp > break_date
)
test_tbl %>% glimpse()## Rows: 2,694
## Columns: 5
## $ customer_id <chr> "C201103_0202", "C201103_0202", "C201103_0202", "C201103…
## $ cohort_qtr <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "…
## $ cohort_ym <chr> "2011 03", "2011 03", "2011 03", "2011 03", "2011 03", "…
## $ tnx_timestamp <dttm> 2018-03-08 23:43:06, 2018-04-27 18:58:50, 2018-09-12 16…
## $ tnx_amount <dbl> 57.28, 65.02, 71.91, 61.95, 60.90, 63.58, 67.77, 66.26, …
We need to reconstruct the summary statistics
training_stats_tbl <- training_tbl %>%
calculate_transaction_cbs_data(
last_date = break_date
) %>%
mutate(
min_lifetime = t_x,
max_lifetime = pmin(t_x + 100, T_cal)
)
training_stats_tbl %>% glimpse()## Rows: 10,000
## Columns: 8
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ first_tnx_date <dttm> 2011-01-01 06:38:12, 2011-01-01 20:17:36, 2011-01-01 0…
## $ last_tnx_date <dttm> 2011-01-23 19:21:11, 2013-11-23 21:55:42, 2011-06-11 0…
## $ x <dbl> 1, 19, 2, 2, 3, 3, 2, 2, 1, 2, 1, 4, 4, 1, 4, 1, 1, 3, …
## $ t_x <dbl> 3.2185499, 151.0097313, 23.0424788, 3.3377017, 21.64700…
## $ T_cal <dbl> 365.2462, 365.1649, 365.2819, 365.2011, 365.2057, 365.1…
## $ min_lifetime <dbl> 3.2185499, 151.0097313, 23.0424788, 3.3377017, 21.64700…
## $ max_lifetime <dbl> 103.2185, 251.0097, 123.0425, 103.3377, 121.6470, 104.4…
For model validation purposes, we just need to check for the presence of not
of customer transactions in the test set.
2 Construct Initial Lifetime Models
We now turn our attention to fitting our model to get estimates of the lifetime of the customer.
There is one important distinction between our lifetime model and the frequency model: we do not have direct observations of the lifetime of the customer.
Instead, what we have an exact observation of each customer’s minimum possible value for the lifetime: the time between the very first transaction that customer made and the most recently observed transaction.
We know that the customer’s lifetime is at least this number.
2.1 Fit Flat Model
We start with a simple flat model without any priors like we did for the frequency models.
## data {
## int<lower=1> n; // count of observations
##
## vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
## vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
## vector<lower=0>[n] obs_time; // observation time since customer 'birth'
## }
##
## parameters {
## vector<lower=0>[n] mu;
## }
##
## model {
## target += exponential_lccdf(min_lifetime | mu);
## target += exponential_lcdf (max_lifetime | mu);
## }
##
## generated quantities {
## #include lifetime_genquan.stan
## }
We now compile this model and fit it with our observed data for customer lifetimes.
ltmodel_flat_stanmodel <- cmdstan_model(
"stan_code/ltmodel_flat.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "ltmodel_flat"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data()
ltmodel_flat_stanfit <- ltmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 77.0 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 78.1 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 78.3 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 78.4 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 77.9 seconds.
## Total execution time: 78.8 seconds.
ltmodel_flat_stanfit$summary()## # A tibble: 50,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -3.73e+4 -3.73e+4 6.85e+1 6.88e+1 -3.74e+4 -3.72e+4 1.00 657.
## 2 mu[1] 3.19e-1 2.24e-1 3.07e-1 2.26e-1 2.53e-2 9.40e-1 1.00 2648.
## 3 mu[2] 9.11e-3 7.34e-3 7.19e-3 5.84e-3 1.27e-3 2.33e-2 1.00 2236.
## 4 mu[3] 4.77e-2 3.54e-2 4.18e-2 3.03e-2 5.94e-3 1.31e-1 1.00 2619.
## 5 mu[4] 3.14e-1 2.22e-1 3.07e-1 2.24e-1 2.13e-2 9.19e-1 1.00 2370.
## 6 mu[5] 5.28e-2 4.00e-2 4.52e-2 3.47e-2 6.48e-3 1.39e-1 1.00 2579.
## 7 mu[6] 2.31e-1 1.60e-1 2.24e-1 1.51e-1 2.03e-2 6.84e-1 1.00 2089.
## 8 mu[7] 8.50e-2 6.09e-2 7.78e-2 5.47e-2 9.15e-3 2.38e-1 0.999 2522.
## 9 mu[8] 8.11e-2 6.01e-2 7.38e-2 5.48e-2 9.65e-3 2.23e-1 1.00 2572.
## 10 mu[9] 5.83e-1 4.06e-1 5.94e-1 4.19e-1 3.99e-2 1.74e+0 1.00 1639.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>
We know this model has not fit properly, but we check the HMC diagnostics anyway.
ltmodel_flat_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_flat-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
As we see, this model is not sampling properly at all, but part of the problem is that we have an implicit uniform prior on \(\mu\) - which does not match our knowledge.
2.2 Fit Fixed-Prior Minimum Time Model
We add a Gamma prior to this model that we set via passing data into the sampler. As we also have less confidence in our maximum lifetime observation we also try fitting our model with just the minimum observed lifetime per customer as a first step.
## data {
## int<lower=1> n; // count of observations
##
## vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
## vector<lower=0>[n] obs_time; // observation time since customer 'birth'
##
## real s;
## real beta;
## }
##
## parameters {
## vector<lower=0>[n] mu;
## }
##
## model {
## mu ~ gamma(s, beta);
##
## target += exponential_lccdf(min_lifetime | mu);
## }
##
## generated quantities {
## #include lifetime_genquan.stan
## }
As always, we then compile our Stan model and fit it to the data.
ltmodel_mintime_stanmodel <- cmdstan_model(
"stan_code/ltmodel_mintime.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "ltmodel_mintime"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data(
s = 1,
beta = 10
)
ltmodel_mintime_stanfit <- ltmodel_mintime_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4202,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 75.8 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 finished in 77.7 seconds.
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 84.2 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 88.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 81.6 seconds.
## Total execution time: 89.5 seconds.
ltmodel_mintime_stanfit$summary()## # A tibble: 50,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.72e+4 -4.72e+4 7.85e+1 7.83e+1 -4.74e+4 -4.71e+4 1.00 725.
## 2 mu[1] 7.53e-2 5.23e-2 7.38e-2 5.19e-2 4.69e-3 2.20e-1 1.00 1782.
## 3 mu[2] 6.67e-3 4.45e-3 6.75e-3 4.72e-3 2.58e-4 2.01e-2 1.00 1989.
## 4 mu[3] 3.03e-2 2.12e-2 3.06e-2 2.19e-2 1.29e-3 8.87e-2 1.00 1923.
## 5 mu[4] 7.79e-2 5.31e-2 7.97e-2 5.49e-2 3.65e-3 2.33e-1 1.00 1335.
## 6 mu[5] 3.12e-2 2.17e-2 3.13e-2 2.21e-2 1.47e-3 9.56e-2 1.00 1640.
## 7 mu[6] 6.59e-2 4.27e-2 6.72e-2 4.55e-2 3.60e-3 2.05e-1 1.00 1525.
## 8 mu[7] 4.44e-2 3.27e-2 4.16e-2 3.25e-2 2.89e-3 1.27e-1 1.00 1730.
## 9 mu[8] 4.20e-2 2.92e-2 4.17e-2 3.03e-2 2.19e-3 1.25e-1 1.00 1747.
## 10 mu[9] 8.48e-2 5.96e-2 8.31e-2 6.00e-2 4.02e-3 2.46e-1 1.00 2024.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>
It appears that this fit is much less problematic, so we check the HMC diagnostics.
ltmodel_mintime_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We should also check some of the trace plots:
ltmodel_mintime_stanfit$draws() %>%
mcmc_trace(
pars = c("mu[1]", "mu[2]", "mu[3]", "mu[4]", "mu[5]", "mu[6]")
) +
ggtitle("Traceplots of Some Mu Values")2.2.1 Investigate Posterior Distribution of Customer Lifetime
We now want to visualise the posterior distribution of the expected lifetime of the customer.
ltmodel_mintime_validation_tbl <- ltmodel_mintime_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(mu[customer_id], tau[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, first_tnx_date, draw_id = .draw,
post_mu = mu, post_tau_mean = tau, post_p_alive = p_alive,
customer_mu, customer_tau
)
ltmodel_mintime_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 8
## $ customer_id <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu <dbl> 0.19913000, 0.15009000, 0.09672760, 0.05766700, 0.06708…
## $ post_tau_mean <dbl> 5.02186, 6.66268, 10.33830, 17.34100, 14.90580, 300.392…
## $ post_p_alive <dbl> 0.00000e+00, 0.00000e+00, 6.06284e-16, 8.57473e-10, 2.8…
## $ customer_mu <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…
We now use our existing routine to compare the posterior values for mu with
the predetermined value.
ltmodel_mintime_mu_qvalues_tbl <- ltmodel_mintime_validation_tbl %>%
calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)
ref_value <- ltmodel_mintime_mu_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_mintime_mu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Mu Posterior Distribution (mintime Model)"
)ggplot(ltmodel_mintime_mu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_mu), size = 1, alpha = 0.05) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Mu (mintime Model)"
)We repeat this process for looking at the \(\tau\) value to see the effect on average lifetime.
ltmodel_mintime_tau_qvalues_tbl <- ltmodel_mintime_validation_tbl %>%
calculate_distribution_qvals(post_tau_mean, customer_tau, customer_id, first_tnx_date)
ref_value <- ltmodel_mintime_tau_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_mintime_tau_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Tau Posterior Distribution (mintime Model)"
)ggplot(ltmodel_mintime_tau_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_tau)) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Tau (mintime Model)"
)It would appear that recovering the parameters for the lifetime model is more difficult than for the frequency model. This is not surprising, as our method for measuring customer lifetime is much noisier, as all we can observe is the minimum value of this lifetime, so it follows our estimates come with much wider uncertainty intervals.
Our issue is not just precision of estimates though - we also seem to have an issue with bias, introduced by the fact that our observations of lifetime are so noisy.
This needs much further investigation, and we calculate some summary statistics for these parameters and plot them against our known ‘true’ values for these quantities.
ltmodel_mintime_postsummary_tbl <- ltmodel_mintime_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mu_p10 = quantile(post_mu, 0.10),
mu_p25 = quantile(post_mu, 0.25),
mu_median = median (post_mu),
mu_mean = mean (post_mu),
mu_p75 = quantile(post_mu, 0.75),
mu_p90 = quantile(post_mu, 0.90),
tau_p10 = quantile(post_tau_mean, 0.10),
tau_p25 = quantile(post_tau_mean, 0.25),
tau_median = median (post_tau_mean),
tau_mean = mean (post_tau_mean),
tau_p75 = quantile(post_tau_mean, 0.75),
tau_p90 = quantile(post_tau_mean, 0.90),
customer_mu = unique(customer_mu),
customer_tau = unique(customer_tau)
)
ltmodel_mintime_postsummary_tbl %>% glimpse()## Rows: 10,000
## Columns: 15
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10 <dbl> 0.0092106810, 0.0006478439, 0.0030852110, 0.0078394130, 0…
## $ mu_p25 <dbl> 0.023857900, 0.001869510, 0.008761672, 0.022462875, 0.009…
## $ mu_median <dbl> 0.052294450, 0.004446245, 0.021176750, 0.053145600, 0.021…
## $ mu_mean <dbl> 0.075263001, 0.006667289, 0.030261938, 0.077919382, 0.031…
## $ mu_p75 <dbl> 0.102917750, 0.009228000, 0.042662625, 0.108517500, 0.042…
## $ mu_p90 <dbl> 0.168895000, 0.015422570, 0.068355390, 0.177157300, 0.072…
## $ tau_p10 <dbl> 5.920841, 64.840050, 14.629450, 5.644710, 13.827150, 6.41…
## $ tau_p25 <dbl> 9.716540, 108.365750, 23.439700, 9.215087, 23.797925, 10.…
## $ tau_median <dbl> 19.12250, 224.90900, 47.22160, 18.81625, 46.00295, 23.434…
## $ tau_mean <dbl> 75.51625, 1070.87630, 1440.37341, 394.15380, 295.87336, 8…
## $ tau_p75 <dbl> 41.91505, 534.90125, 114.13375, 44.51793, 108.24575, 53.6…
## $ tau_p90 <dbl> 108.5700, 1543.5870, 324.1269, 127.5604, 305.6213, 139.33…
## $ customer_mu <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…
We can now visualise these summary statistics.
plot_tbl <- ltmodel_mintime_postsummary_tbl %>%
slice_sample(n = 25) %>%
arrange(customer_id)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red", size = 2.5) +
labs(
x = "Customer ID",
y = "Posterior Mu",
title = "Summary Statistics of Mu Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red", size = 2.5) +
scale_y_log10(labels = label_comma()) +
labs(
x = "Customer ID",
y = "Posterior Tau",
title = "Summary Statistics of Tau Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))2.2.2 Investigate q_value Cohorts
We produced a scatter plot of customer_mu against q_value and it does not
look correct - the q-value should be independent of the value of customer_mu
so we need to look into this further.
To start with, we look at the lower end of the q_value distribution.
plot_tbl <- ltmodel_mintime_mu_qvalues_tbl %>%
filter(q_val < 0.1) %>%
select(customer_id) %>%
inner_join(ltmodel_mintime_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Mu",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))We are looking at customer_mu but it might require instead looking at
customer_tau.
### This is not
plot_tbl <- ltmodel_mintime_mu_qvalues_tbl %>%
filter(q_val < 0.1) %>%
select(customer_id) %>%
inner_join(ltmodel_mintime_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Tau",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))A few outliers are skewing this, so we add a log-scale to the \(y\)-axis plot.
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red") +
scale_y_log10(labels = label_comma()) +
labs(
x = "Customer ID",
y = "Lifetime Tau",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))It seems that the fact we only have censored observations of the lifetime is going to impair the ability of the model to recover parameters.
2.2.3 Validate p_alive Metric
A final validation approach is to take the average value for p_alive for
each customer, group all these values into equally-spaced buckets and then
check the proportion for each of these buckets against the proportion of
customers in that bucket that made a subsequent transaction.
This gives us a visual validation of the proportions.
ltmodel_mintime_alive_tbl <- ltmodel_mintime_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mean_p_alive = mean(post_p_alive)
) %>%
mutate(
p_alive_decile = cut(
mean_p_alive,
breaks = seq(0, 1, by = 0.05),
labels = sprintf("PALIVE_%03d", 5 * 1:20)
)
) %>%
left_join(
test_tbl %>% count(customer_id, name = "tnx_count"),
by = "customer_id"
) %>%
replace_na(
list(tnx_count = 0)
) %>%
mutate(still_alive = if_else(tnx_count > 0, "ACTIVE", "INACTIVE"))
ltmodel_mintime_alive_tbl %>% glimpse()## Rows: 10,000
## Columns: 5
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ mean_p_alive <dbl> 0.02978512, 0.41704520, 0.09121881, 0.03694856, 0.08673…
## $ p_alive_decile <fct> PALIVE_005, PALIVE_045, PALIVE_010, PALIVE_005, PALIVE_…
## $ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ still_alive <chr> "INACTIVE", "INACTIVE", "INACTIVE", "INACTIVE", "INACTI…
Having constructed the dataset, we then construct the visualisation of the proportions.
ltmodel_mintime_alive_valid_tbl <- ltmodel_mintime_alive_tbl %>%
count(p_alive_decile, still_alive) %>%
pivot_wider(
id_cols = p_alive_decile,
names_from = still_alive,
values_from = n,
values_fill = 0
) %>%
mutate(
top_prop = str_replace(p_alive_decile, "PALIVE_(\\d+)", "\\1") %>%
as.numeric() %>%
divide_by(100),
active_prop = ACTIVE / (ACTIVE + INACTIVE)
)
ggplot(ltmodel_mintime_alive_valid_tbl) +
geom_point(aes(x = top_prop, y = active_prop)) +
geom_line(aes(x = top_prop, y = top_prop), colour = "red") +
labs(
x = "Bucket Proportion",
y = "Customer Proportion",
title = "Comparison Plot of p_alive Against Active Proportion"
) +
expand_limits(x = 0)As before, it looks like we are overstating the probability of a customer
being alive, as the proportion of customers actually transacting is lower than
suggested by the value of p_alive, and so our customer are going inactive at
a higher rate than we expect and so we have an upward bias on the customer
lifetime.
We continue for now though, and see if additions to the model can improve things.
ltmodel_mintime_stanfit %>% write_rds("data/ltmodel_mintime_stanfit.rds")2.3 Fit Fixed-Prior Min/Max Time Model
It is clear that we need to improve our approach for estimating customer lifetimes. We try including the maximum time in the model.
## data {
## int<lower=1> n; // count of observations
##
## vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
## vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
## vector<lower=0>[n] obs_time; // observation time since customer 'birth'
##
## real s;
## real beta;
## }
##
## parameters {
## vector<lower=0>[n] mu;
## }
##
## model {
## mu ~ gamma(s, beta);
##
## target += exponential_lccdf(min_lifetime | mu);
## target += exponential_lcdf (max_lifetime | mu);
## }
##
## generated quantities {
## #include lifetime_genquan.stan
## }
As always, we then compile our Stan model and fit it to the data.
ltmodel_minmax_stanmodel <- cmdstan_model(
"stan_code/ltmodel_minmax.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)With this new model we use the same data again to fit our new model.
stan_modelname <- "ltmodel_minmax"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data(
s = 1,
beta = 10
)
ltmodel_minmax_stanfit <- ltmodel_minmax_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4203,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 83.4 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 84.5 seconds.
## Chain 4 finished in 83.9 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 85.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 84.2 seconds.
## Total execution time: 85.5 seconds.
ltmodel_minmax_stanfit$summary()## # A tibble: 50,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.73e+4 -4.73e+4 7.37e+1 7.15e+1 -4.74e+4 -4.72e+4 1.00 750.
## 2 mu[1] 8.39e-2 6.12e-2 7.55e-2 5.24e-2 9.48e-3 2.37e-1 0.999 2591.
## 3 mu[2] 8.59e-3 6.90e-3 6.41e-3 5.09e-3 1.52e-3 2.10e-2 1.00 3156.
## 4 mu[3] 3.69e-2 2.82e-2 3.18e-2 2.25e-2 4.86e-3 1.03e-1 1.00 3906.
## 5 mu[4] 8.18e-2 5.91e-2 7.26e-2 5.36e-2 9.63e-3 2.32e-1 1.00 3202.
## 6 mu[5] 3.72e-2 2.87e-2 3.00e-2 2.35e-2 6.00e-3 9.68e-2 1.00 2625.
## 7 mu[6] 7.80e-2 5.69e-2 6.89e-2 5.15e-2 9.15e-3 2.13e-1 0.999 3407.
## 8 mu[7] 5.35e-2 3.85e-2 4.82e-2 3.33e-2 6.85e-3 1.52e-1 1.00 2795.
## 9 mu[8] 5.03e-2 3.73e-2 4.40e-2 3.28e-2 6.18e-3 1.34e-1 0.999 2811.
## 10 mu[9] 9.59e-2 6.77e-2 8.82e-2 6.11e-2 1.10e-2 2.75e-1 0.999 2629.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>
As before, we need to check the HMC diagnostics.
ltmodel_minmax_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
2.3.1 Plot Customer Lifetime Diagnostic Plots
We now want to visualise the posterior distribution of the expected lifetime of the customer.
ltmodel_minmax_validation_tbl <- ltmodel_minmax_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(mu[customer_id], tau[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, first_tnx_date, draw_id = .draw,
post_mu = mu, post_tau_mean = tau, post_p_alive = p_alive,
customer_mu, customer_tau
)
ltmodel_minmax_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 8
## $ customer_id <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu <dbl> 0.47066600, 0.07995400, 0.07683270, 0.08568460, 0.01306…
## $ post_tau_mean <dbl> 2.12465, 12.50720, 13.01530, 11.67070, 76.52430, 4.1747…
## $ post_p_alive <dbl> 0.00000e+00, 2.68542e-13, 8.31550e-13, 3.37903e-14, 8.8…
## $ customer_mu <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…
We now use our existing routine to compare the posterior values for mu with
the predetermined value.
ltmodel_minmax_mu_qvalues_tbl <- ltmodel_minmax_validation_tbl %>%
calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)
ref_value <- ltmodel_minmax_mu_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_minmax_mu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(ltmodel_minmax_mu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_mu)) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Mu"
)We repeat this process for looking at the \(\tau\) value to see the effect on average lifetime.
ltmodel_minmax_tau_qvalues_tbl <- ltmodel_minmax_validation_tbl %>%
calculate_distribution_qvals(post_tau_mean, customer_tau, customer_id, first_tnx_date)
ref_value <- ltmodel_minmax_tau_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_minmax_tau_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Tau Posterior Distribution (minmax Model)"
)ggplot(ltmodel_minmax_tau_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_tau)) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Tau (minmax Model)"
)2.3.2 Calculate minmax Model Posterior Summary Statistics
We want to calculate some posterior quantities summary statistics.
ltmodel_minmax_postsummary_tbl <- ltmodel_minmax_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mu_p10 = quantile(post_mu, 0.10),
mu_p25 = quantile(post_mu, 0.25),
mu_median = median (post_mu),
mu_mean = mean (post_mu),
mu_p75 = quantile(post_mu, 0.75),
mu_p90 = quantile(post_mu, 0.90),
tau_p10 = quantile(post_tau_mean, 0.10),
tau_p25 = quantile(post_tau_mean, 0.25),
tau_median = median (post_tau_mean),
tau_mean = mean (post_tau_mean),
tau_p75 = quantile(post_tau_mean, 0.75),
tau_p90 = quantile(post_tau_mean, 0.90),
customer_mu = unique(customer_mu),
customer_tau = unique(customer_tau)
)
ltmodel_minmax_postsummary_tbl %>% glimpse()## Rows: 10,000
## Columns: 15
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10 <dbl> 0.015611020, 0.002247156, 0.007969247, 0.015621790, 0.008…
## $ mu_p25 <dbl> 0.030674475, 0.003973943, 0.015086350, 0.029101200, 0.015…
## $ mu_median <dbl> 0.061161550, 0.006903310, 0.028158650, 0.059064800, 0.028…
## $ mu_mean <dbl> 0.083906509, 0.008591510, 0.036924538, 0.081846056, 0.037…
## $ mu_p75 <dbl> 0.11321400, 0.01155767, 0.04844448, 0.11305950, 0.0511908…
## $ mu_p90 <dbl> 0.17922390, 0.01706617, 0.07777854, 0.17814330, 0.0761281…
## $ tau_p10 <dbl> 5.579605, 58.595380, 12.857000, 5.613448, 13.135870, 5.75…
## $ tau_p25 <dbl> 8.832825, 86.522675, 20.642150, 8.844893, 19.534750, 9.23…
## $ tau_median <dbl> 16.35015, 144.85800, 35.51305, 16.93060, 34.84010, 17.572…
## $ tau_mean <dbl> 34.47874, 223.04430, 62.52966, 31.97575, 63.05222, 33.679…
## $ tau_p75 <dbl> 32.60040, 251.63950, 66.28508, 34.36285, 64.33345, 35.164…
## $ tau_p90 <dbl> 64.05778, 445.00800, 125.48650, 64.01318, 119.72810, 70.3…
## $ customer_mu <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…
We can now visualise these summary statistics.
plot_tbl <- ltmodel_minmax_postsummary_tbl %>%
slice_sample(n = 25) %>%
arrange(customer_id)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red", size = 2.5) +
labs(
x = "Customer ID",
y = "Posterior Mu",
title = "Summary Statistics of Mu Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red", size = 2.5) +
scale_y_log10(labels = label_comma()) +
labs(
x = "Customer ID",
y = "Posterior Tau",
title = "Summary Statistics of Tau Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))2.3.3 Investigate Extreme q-Value
As we have done with our “Minimum Time” model, we look at the data where the \(q\) value is very low.
plot_tbl <- ltmodel_minmax_mu_qvalues_tbl %>%
filter(q_val < 0.05) %>%
select(customer_id) %>%
inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Mu",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))We are looking at customer_mu but it might require instead looking at
customer_tau.
### This is not
plot_tbl <- ltmodel_minmax_mu_qvalues_tbl %>%
filter(q_val < 0.05) %>%
select(customer_id) %>%
inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Tau",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))2.3.4 Validate p_alive Metric
Once again, we construct the validation procedure for p_alive.
ltmodel_minmax_alive_tbl <- ltmodel_minmax_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mean_p_alive = mean(post_p_alive)
) %>%
mutate(
p_alive_decile = cut(
mean_p_alive,
breaks = seq(0, 1, by = 0.05),
labels = sprintf("PALIVE_%03d", 5 * 1:20)
)
) %>%
left_join(
test_tbl %>% count(customer_id, name = "tnx_count"),
by = "customer_id"
) %>%
replace_na(
list(tnx_count = 0)
) %>%
mutate(still_alive = if_else(tnx_count > 0, "ACTIVE", "INACTIVE"))
ltmodel_minmax_alive_tbl %>% glimpse()## Rows: 10,000
## Columns: 5
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ mean_p_alive <dbl> 0.009541509, 0.277784873, 0.027687502, 0.008255639, 0.0…
## $ p_alive_decile <fct> PALIVE_005, PALIVE_030, PALIVE_005, PALIVE_005, PALIVE_…
## $ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ still_alive <chr> "INACTIVE", "INACTIVE", "INACTIVE", "INACTIVE", "INACTI…
Having constructed the dataset, we then construct the visualisation of the proportions.
ltmodel_minmax_alive_valid_tbl <- ltmodel_minmax_alive_tbl %>%
count(p_alive_decile, still_alive) %>%
pivot_wider(
id_cols = p_alive_decile,
names_from = still_alive,
values_from = n,
values_fill = 0
) %>%
mutate(
top_prop = str_replace(p_alive_decile, "PALIVE_(\\d+)", "\\1") %>%
as.numeric() %>%
divide_by(100),
active_prop = ACTIVE / (ACTIVE + INACTIVE)
)
ggplot(ltmodel_minmax_alive_valid_tbl) +
geom_point(aes(x = top_prop, y = active_prop)) +
geom_line(aes(x = top_prop, y = top_prop), colour = "red") +
labs(
x = "Bucket Proportion",
y = "Customer Proportion",
title = "Comparison Plot of p_alive Against Active Proportion"
) +
expand_limits(x = 0)ltmodel_minmax_stanfit %>% write_rds("data/ltmodel_minmax_stanfit.rds")2.4 Fit Tighter Fixed-Prior Min/Max Time Model
We now fit the same model but with a tighter prior around \(\mu\).
stan_modelname <- "ltmodel_minmax"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data(
s = 5,
beta = 50
)
ltmodel_tightprior_stanfit <- ltmodel_minmax_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4204,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 76.6 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 77.5 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 77.2 seconds.
## Chain 4 finished in 77.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 77.1 seconds.
## Total execution time: 78.8 seconds.
ltmodel_tightprior_stanfit$summary()## # A tibble: 50,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -1.85e+5 -1.85e+5 72.5 69.7 -1.85e+5 -1.85e+5 1.00 724.
## 2 mu[1] 9.47e-2 8.88e-2 0.0412 0.0400 3.83e-2 1.70e-1 1.00 3611.
## 3 mu[2] 2.53e-2 2.36e-2 0.0112 0.0106 1.03e-2 4.70e-2 1.00 3683.
## 4 mu[3] 6.90e-2 6.36e-2 0.0322 0.0293 2.68e-2 1.30e-1 1.00 3185.
## 5 mu[4] 9.46e-2 8.80e-2 0.0410 0.0382 3.90e-2 1.72e-1 1.00 3052.
## 6 mu[5] 7.00e-2 6.42e-2 0.0320 0.0306 2.70e-2 1.28e-1 1.00 2796.
## 7 mu[6] 9.17e-2 8.56e-2 0.0422 0.0398 3.40e-2 1.70e-1 1.00 3720.
## 8 mu[7] 7.94e-2 7.43e-2 0.0347 0.0324 3.19e-2 1.44e-1 1.00 3477.
## 9 mu[8] 7.84e-2 7.30e-2 0.0345 0.0327 3.09e-2 1.43e-1 1.00 3563.
## 10 mu[9] 9.54e-2 9.05e-2 0.0405 0.0380 3.78e-2 1.68e-1 0.999 2580.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>
As before, we need to check the HMC diagnostics.
ltmodel_tightprior_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
2.4.1 Plot Customer Lifetime Diagnostic Plots
Once again we construct validation plots for our data - comparing the posterior distributions against the known customer value.
ltmodel_tightprior_validation_tbl <- ltmodel_tightprior_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(mu[customer_id], tau[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, first_tnx_date, draw_id = .draw,
post_mu = mu, post_tau_mean = tau, post_p_alive = p_alive,
customer_mu, customer_tau
)
ltmodel_tightprior_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 8
## $ customer_id <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu <dbl> 0.1100270, 0.1016490, 0.0418283, 0.1075610, 0.1600910, …
## $ post_tau_mean <dbl> 9.08870, 9.83780, 23.90730, 9.29704, 6.24645, 9.09592, …
## $ post_p_alive <dbl> 0.00000e+00, 1.53991e-16, 2.65144e-07, 0.00000e+00, 0.0…
## $ customer_mu <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…
We now use our existing routine to compare the posterior values for mu with
the predetermined value.
ltmodel_tightprior_mu_qvalues_tbl <- ltmodel_tightprior_validation_tbl %>%
calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)
ref_value <- ltmodel_tightprior_mu_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_tightprior_mu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(ltmodel_tightprior_mu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_mu)) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Mu"
)2.4.2 Calculate tightprior Model Posterior Summary Statistics
We want to calculate some posterior quantities summary statistics.
ltmodel_tightprior_postsummary_tbl <- ltmodel_tightprior_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mu_p10 = quantile(post_mu, 0.10),
mu_p25 = quantile(post_mu, 0.25),
mu_median = median (post_mu),
mu_mean = mean (post_mu),
mu_p75 = quantile(post_mu, 0.75),
mu_p90 = quantile(post_mu, 0.90),
tau_p10 = quantile(post_tau_mean, 0.10),
tau_p25 = quantile(post_tau_mean, 0.25),
tau_median = median (post_tau_mean),
tau_mean = mean (post_tau_mean),
tau_p75 = quantile(post_tau_mean, 0.75),
tau_p90 = quantile(post_tau_mean, 0.90),
customer_mu = unique(customer_mu),
customer_tau = unique(customer_tau)
)
ltmodel_tightprior_postsummary_tbl %>% glimpse()## Rows: 10,000
## Columns: 15
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10 <dbl> 0.046494800, 0.012536390, 0.033793800, 0.048006760, 0.033…
## $ mu_p25 <dbl> 0.06424620, 0.01683717, 0.04572955, 0.06492627, 0.0465805…
## $ mu_median <dbl> 0.08876680, 0.02364685, 0.06358135, 0.08796385, 0.0641517…
## $ mu_mean <dbl> 0.09469945, 0.02527723, 0.06897114, 0.09458880, 0.0699766…
## $ mu_p75 <dbl> 0.11905750, 0.03158462, 0.08611853, 0.11793100, 0.0888050…
## $ mu_p90 <dbl> 0.14998180, 0.04081577, 0.11100260, 0.15017330, 0.1130137…
## $ tau_p10 <dbl> 6.667497, 24.500340, 9.008766, 6.658982, 8.848469, 6.7340…
## $ tau_p25 <dbl> 8.399290, 31.660975, 11.611900, 8.479540, 11.260650, 8.68…
## $ tau_median <dbl> 11.26545, 42.28900, 15.72790, 11.36830, 15.58805, 11.6876…
## $ tau_mean <dbl> 13.04351, 48.91947, 18.29236, 13.02305, 17.93547, 13.8739…
## $ tau_p75 <dbl> 15.56510, 59.39232, 21.86772, 15.40207, 21.46817, 16.3357…
## $ tau_p90 <dbl> 21.50777, 79.76759, 29.59126, 20.83043, 29.59817, 23.2381…
## $ customer_mu <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…
We can now visualise these summary statistics.
plot_tbl <- ltmodel_tightprior_postsummary_tbl %>%
slice_sample(n = 25) %>%
arrange(customer_id)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red", size = 2.5) +
labs(
x = "Customer ID",
y = "Posterior Mu",
title = "Summary Statistics of Mu Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red", size = 2.5) +
scale_y_log10(labels = label_comma()) +
labs(
x = "Customer ID",
y = "Posterior Tau",
title = "Summary Statistics of Tau Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))2.4.3 Investigate Extreme q-Value
As we have done with our “Tight Prior” model, we look at the data where the \(q\) value is very low.
plot_tbl <- ltmodel_tightprior_mu_qvalues_tbl %>%
filter(q_val < 0.025) %>%
select(customer_id) %>%
inner_join(ltmodel_tightprior_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Mu",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))We are looking at customer_mu but it might require instead looking at
customer_tau.
### This is not
plot_tbl <- ltmodel_tightprior_mu_qvalues_tbl %>%
filter(q_val < 0.025) %>%
select(customer_id) %>%
inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Tau",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))2.4.4 Validate p_alive Metric
Once again, we construct the validation procedure for p_alive.
ltmodel_tightprior_alive_tbl <- ltmodel_tightprior_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mean_p_alive = mean(post_p_alive)
) %>%
mutate(
p_alive_decile = cut(
mean_p_alive,
breaks = seq(0, 1, by = 0.05),
labels = sprintf("PALIVE_%03d", 5 * 1:20)
)
) %>%
left_join(
test_tbl %>% count(customer_id, name = "tnx_count"),
by = "customer_id"
) %>%
replace_na(
list(tnx_count = 0)
) %>%
mutate(still_alive = if_else(tnx_count > 0, "ACTIVE", "INACTIVE"))
ltmodel_tightprior_alive_tbl %>% glimpse()## Rows: 10,000
## Columns: 5
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ mean_p_alive <dbl> 2.606529e-05, 2.431313e-02, 2.055829e-04, 2.505586e-05,…
## $ p_alive_decile <fct> PALIVE_005, PALIVE_005, PALIVE_005, PALIVE_005, PALIVE_…
## $ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ still_alive <chr> "INACTIVE", "INACTIVE", "INACTIVE", "INACTIVE", "INACTI…
Having constructed the dataset, we then construct the visualisation of the proportions.
ltmodel_tightprior_alive_valid_tbl <- ltmodel_tightprior_alive_tbl %>%
count(p_alive_decile, still_alive) %>%
pivot_wider(
id_cols = p_alive_decile,
names_from = still_alive,
values_from = n,
values_fill = 0
) %>%
mutate(
top_prop = str_replace(p_alive_decile, "PALIVE_(\\d+)", "\\1") %>%
as.numeric() %>%
divide_by(100),
active_prop = ACTIVE / (ACTIVE + INACTIVE)
)
ggplot(ltmodel_tightprior_alive_valid_tbl) +
geom_point(aes(x = top_prop, y = active_prop)) +
geom_line(aes(x = top_prop, y = top_prop), colour = "red") +
labs(
x = "Bucket Proportion",
y = "Customer Proportion",
title = "Comparison Plot of p_alive Against Active Proportion"
) +
expand_limits(x = 0)ltmodel_tightprior_stanfit %>% write_rds("data/ltmodel_tightprior_stanfit.rds")3 Construct the Hierarchical Lifetime Model
We now look at adding uncertainty to our priors, as we did with the frequency model.
## data {
## int<lower=1> n; // count of observations
##
## vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
## vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
## vector<lower=0>[n] obs_time; // observation time since customer 'birth'
##
## real<lower=0> mean_p1, mean_p2;
## real<lower=0> cv_p1, cv_p2;
## }
##
## parameters {
## real<lower=0> hier_mean;
## real<lower=0> hier_cv;
##
## vector<lower=0>[n] mu;
## }
##
## transformed parameters {
## // shape <- 1 / (cv^2)
## // scale <- mu * cv^2
##
## real<lower=0> s = 1 / (hier_cv * hier_cv);
## real<lower=0> beta = 1 / (hier_mean * hier_cv * hier_cv);
## }
##
## model {
## hier_mean ~ gamma(mean_p1, mean_p1);
## hier_cv ~ gamma(cv_p1, cv_p2);
##
## mu ~ gamma(s, beta);
##
## target += exponential_lccdf(min_lifetime | mu);
## target += exponential_lcdf (max_lifetime | mu);
## }
##
## generated quantities {
## #include lifetime_genquan.stan
## }
We see this model if very similar to the frequency model, using Gamma priors.
3.1 Compile and Fit First Hierarchical Model
We first need to compile this fitted model.
ltmodel_hier_stanmodel <- cmdstan_model(
"stan_code/ltmodel_hier.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)stan_modelname <- "ltmodel_hier"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data(
mean_p1 = 5,
mean_p2 = 50,
cv_p1 = 5,
cv_p2 = 5
)
ltmodel_hier_stanfit <- ltmodel_hier_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4205,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 165.7 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 169.6 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 171.5 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 174.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 170.3 seconds.
## Total execution time: 175.0 seconds.
ltmodel_hier_stanfit$summary()## # A tibble: 50,005 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ 6291. 6330. 1.39e+3 1.40e+3 3.90e+3 8.65e+3 2.10 5.30
## 2 hier_mean 0.0185 0.0184 1.77e-4 1.74e-4 1.82e-2 1.88e-2 1.26 12.7
## 3 hier_cv 0.0680 0.0671 9.47e-3 9.31e-3 5.30e-2 8.55e-2 2.09 5.32
## 4 mu[1] 0.0185 0.0185 1.28e-3 1.22e-3 1.64e-2 2.06e-2 1.01 1757.
## 5 mu[2] 0.0182 0.0183 1.23e-3 1.24e-3 1.63e-2 2.03e-2 1.01 1662.
## 6 mu[3] 0.0184 0.0184 1.29e-3 1.24e-3 1.64e-2 2.07e-2 1.01 2018.
## 7 mu[4] 0.0185 0.0185 1.31e-3 1.30e-3 1.64e-2 2.07e-2 1.01 1908.
## 8 mu[5] 0.0184 0.0184 1.29e-3 1.25e-3 1.63e-2 2.05e-2 1.01 1382.
## 9 mu[6] 0.0185 0.0185 1.25e-3 1.21e-3 1.64e-2 2.06e-2 1.01 1419.
## 10 mu[7] 0.0185 0.0184 1.33e-3 1.25e-3 1.64e-2 2.08e-2 1.02 1710.
## # … with 49,995 more rows, and 1 more variable: ess_tail <dbl>
As before, we need to check the HMC diagnostics.
ltmodel_hier_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_hier-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## The E-BFMI, 0.01, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
## If possible, try to reparameterize the model.
##
## Effective sample size satisfactory.
##
## The following parameters had split R-hat greater than 1.05:
## hier_mean, hier_cv, s, beta
## Such high values indicate incomplete mixing and biased estimation.
## You should consider regularizating your model with additional prior information or a more effective parameterization.
##
## Processing complete.
It looks like we have too many degrees of freedom with this model - our data is not sufficiently strong to allow the model to discriminate around parameters and instead we have identifiability issues.
Not that we are not finding any divergent transitions in this model - instead we find that our posterior sample has not converged.
To illustrate this we want to look at the traceplots for the hierarchical parameters.
mcmc_trace(
x = ltmodel_hier_stanfit$draws(),
pars = c("hier_mean", "hier_cv")
) +
ggtitle("Traceplots of Lifetime Hierarchical Parameter Values")ltmodel_hier_stanfit %>% write_rds("data/ltmodel_hier_stanfit.rds")3.2 Compile and Fit One-Parameter Hierarchical Model
We need to reduce the degrees of freedom in this hierarchical model and so we fix one of the hierarchical parameters and instead allow just one of the Gamma parameters to be fit by the data.
ltmodel_hier_one_stanmodel <- cmdstan_model(
"stan_code/ltmodel_hier_one.stan",
include_paths = stan_codedir,
dir = stan_modeldir,
pedantic = TRUE
)We now fit this model using our data and our priors.
stan_modelname <- "ltmodel_hier_one"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
compose_data(
mean_p1 = 2,
mean_p2 = 20,
s = 1
)
ltmodel_hier_one_stanfit <- ltmodel_hier_one_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4205,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 134.5 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 135.5 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 137.5 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 139.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 136.8 seconds.
## Total execution time: 140.6 seconds.
ltmodel_hier_one_stanfit$summary()## # A tibble: 50,003 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -2.09e+4 -2.09e+4 7.31e+1 7.44e+1 -2.10e+4 -2.08e+4 1.00 631.
## 2 hier_mean 2.76e-2 2.75e-2 4.85e-4 4.86e-4 2.68e-2 2.84e-2 1.00 931.
## 3 mu[1] 3.22e-2 2.57e-2 2.61e-2 2.03e-2 4.25e-3 8.44e-2 1.00 2706.
## 4 mu[2] 7.72e-3 6.12e-3 5.77e-3 4.70e-3 1.28e-3 1.92e-2 1.00 2438.
## 5 mu[3] 2.20e-2 1.77e-2 1.67e-2 1.24e-2 3.63e-3 5.51e-2 1.00 2521.
## 6 mu[4] 3.17e-2 2.46e-2 2.56e-2 1.89e-2 4.93e-3 8.28e-2 1.00 2453.
## 7 mu[5] 2.26e-2 1.82e-2 1.74e-2 1.35e-2 3.70e-3 5.70e-2 1.00 2454.
## 8 mu[6] 3.15e-2 2.47e-2 2.53e-2 1.92e-2 5.08e-3 8.38e-2 1.00 2261.
## 9 mu[7] 2.71e-2 2.14e-2 2.06e-2 1.60e-2 4.40e-3 6.77e-2 1.00 3033.
## 10 mu[8] 2.56e-2 2.03e-2 2.08e-2 1.59e-2 4.10e-3 6.53e-2 0.999 2748.
## # … with 49,993 more rows, and 1 more variable: ess_tail <dbl>
As before, we need to check the HMC diagnostics.
ltmodel_hier_one_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We should also check some of the trace plots:
mcmc_trace(
x = ltmodel_hier_one_stanfit$draws(),
pars = c("hier_mean")
) +
ggtitle("Traceplots of Lifetime Hierarchical Parameter Values")3.2.1 Plot Customer Lifetime Diagnostic Plots
Once again we construct validation plots for our data - comparing the posterior distributions against the known customer value.
ltmodel_hier_one_validation_tbl <- ltmodel_hier_one_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(mu[customer_id], tau[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, first_tnx_date, draw_id = .draw,
post_mu = mu, post_tau_mean = tau, post_p_alive = p_alive,
customer_mu, customer_tau
)
ltmodel_hier_one_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 8
## $ customer_id <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu <dbl> 0.01524500, 0.09333740, 0.08243720, 0.03675360, 0.02700…
## $ post_tau_mean <dbl> 65.59540, 10.71380, 12.13040, 27.20820, 37.03190, 78.94…
## $ post_p_alive <dbl> 4.00943e-03, 2.09896e-15, 1.09292e-13, 1.66475e-06, 5.6…
## $ customer_mu <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…
We now use our existing routine to compare the posterior values for mu with
the predetermined value.
ltmodel_hier_one_mu_qvalues_tbl <- ltmodel_hier_one_validation_tbl %>%
calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)
ref_value <- ltmodel_hier_one_mu_qvalues_tbl %>%
nrow() %>%
divide_by(50)
ggplot(ltmodel_hier_one_mu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(ltmodel_hier_one_mu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_mu)) +
labs(
x = "Quantile",
y = "Customer Mu",
title = "Scatterplot of q-Value against Mu"
)3.2.2 Calculate hier_one Model Posterior Summary Statistics
We want to calculate some posterior quantities summary statistics.
ltmodel_hier_one_postsummary_tbl <- ltmodel_hier_one_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mu_p10 = quantile(post_mu, 0.10),
mu_p25 = quantile(post_mu, 0.25),
mu_median = median (post_mu),
mu_mean = mean (post_mu),
mu_p75 = quantile(post_mu, 0.75),
mu_p90 = quantile(post_mu, 0.90),
tau_p10 = quantile(post_tau_mean, 0.10),
tau_p25 = quantile(post_tau_mean, 0.25),
tau_median = median (post_tau_mean),
tau_mean = mean (post_tau_mean),
tau_p75 = quantile(post_tau_mean, 0.75),
tau_p90 = quantile(post_tau_mean, 0.90),
customer_mu = unique(customer_mu),
customer_tau = unique(customer_tau)
)
ltmodel_hier_one_postsummary_tbl %>% glimpse()## Rows: 10,000
## Columns: 15
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10 <dbl> 0.006584319, 0.001911039, 0.005625193, 0.007399732, 0.005…
## $ mu_p25 <dbl> 0.013751725, 0.003495250, 0.010576075, 0.013584125, 0.010…
## $ mu_median <dbl> 0.025724750, 0.006117480, 0.017723200, 0.024568500, 0.018…
## $ mu_mean <dbl> 0.032192187, 0.007715464, 0.021957003, 0.031732767, 0.022…
## $ mu_p75 <dbl> 0.04271982, 0.01048665, 0.02902128, 0.04240180, 0.0301956…
## $ mu_p90 <dbl> 0.06496142, 0.01554940, 0.04342043, 0.06614218, 0.0448212…
## $ tau_p10 <dbl> 15.39372, 64.31107, 23.03065, 15.11899, 22.31086, 15.6097…
## $ tau_p25 <dbl> 23.40835, 95.35902, 34.45745, 23.58390, 33.11742, 23.9691…
## $ tau_median <dbl> 38.87300, 163.46600, 56.42320, 40.70250, 54.97885, 40.469…
## $ tau_mean <dbl> 72.87064, 296.78178, 90.39908, 67.36516, 95.18223, 66.654…
## $ tau_p75 <dbl> 72.71855, 286.10250, 94.55350, 73.61538, 97.34675, 70.651…
## $ tau_p90 <dbl> 151.8764, 523.2762, 177.7716, 135.1405, 178.2489, 134.418…
## $ customer_mu <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…
We can now visualise these summary statistics.
plot_tbl <- ltmodel_hier_one_postsummary_tbl %>%
slice_sample(n = 25) %>%
arrange(customer_id)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red", size = 2.5) +
labs(
x = "Customer ID",
y = "Posterior Mu",
title = "Summary Statistics of Mu Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
width = 0, size = 4) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red", size = 2.5) +
scale_y_log10(labels = label_comma()) +
labs(
x = "Customer ID",
y = "Posterior Tau",
title = "Summary Statistics of Tau Values"
) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))3.2.3 Investigate Extreme q-Value
As we have done with our “Tight Prior” model, we look at the data where the \(q\) value is very low.
plot_tbl <- ltmodel_hier_one_mu_qvalues_tbl %>%
filter(q_val < 0.05) %>%
select(customer_id) %>%
inner_join(ltmodel_hier_one_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_mu),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Mu",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))We are looking at customer_mu but it might require instead looking at
customer_tau.
### This is not
plot_tbl <- ltmodel_hier_one_mu_qvalues_tbl %>%
filter(q_val < 0.05) %>%
select(customer_id) %>%
inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
size = 1, width = 0) +
geom_errorbar(
aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
size = 3, width = 0) +
geom_point(
aes(x = customer_id, y = customer_tau),
colour = "red") +
labs(
x = "Customer ID",
y = "Lifetime Tau",
title = "Posterior Estimates of Small q-Value Customers") +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))3.2.4 Validate p_alive Metric
Once again, we construct the validation procedure for p_alive.
ltmodel_hier_one_alive_tbl <- ltmodel_hier_one_validation_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
mean_p_alive = mean(post_p_alive)
) %>%
mutate(
p_alive_decile = cut(
mean_p_alive,
breaks = seq(0, 1, by = 0.05),
labels = sprintf("PALIVE_%03d", 5 * 1:20)
)
) %>%
left_join(
test_tbl %>% count(customer_id, name = "tnx_count"),
by = "customer_id"
) %>%
replace_na(
list(tnx_count = 0)
) %>%
mutate(still_alive = if_else(tnx_count > 0, "ACTIVE", "INACTIVE"))
ltmodel_hier_one_alive_tbl %>% glimpse()## Rows: 10,000
## Columns: 5
## $ customer_id <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ mean_p_alive <dbl> 0.03187255, 0.31023567, 0.04631211, 0.02653952, 0.04649…
## $ p_alive_decile <fct> PALIVE_005, PALIVE_035, PALIVE_005, PALIVE_005, PALIVE_…
## $ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ still_alive <chr> "INACTIVE", "INACTIVE", "INACTIVE", "INACTIVE", "INACTI…
Having constructed the dataset, we then construct the visualisation of the proportions.
ltmodel_hier_one_alive_valid_tbl <- ltmodel_hier_one_alive_tbl %>%
count(p_alive_decile, still_alive) %>%
pivot_wider(
id_cols = p_alive_decile,
names_from = still_alive,
values_from = n,
values_fill = 0
) %>%
mutate(
top_prop = str_replace(p_alive_decile, "PALIVE_(\\d+)", "\\1") %>%
as.numeric() %>%
divide_by(100),
active_prop = ACTIVE / (ACTIVE + INACTIVE)
)
ggplot(ltmodel_hier_one_alive_valid_tbl) +
geom_point(aes(x = top_prop, y = active_prop)) +
geom_line(aes(x = top_prop, y = top_prop), colour = "red") +
labs(
x = "Bucket Proportion",
y = "Customer Proportion",
title = "Comparison Plot of p_alive Against Active Proportion"
) +
expand_limits(x = 0)ltmodel_hier_one_stanfit %>% write_rds("data/ltmodel_hier_one_stanfit.rds")4 Construct Lifetime p_alive Comparison Plot
Finally, we want to take all the data from our p_alive value comparison plots
together and see how our different models combine.
p_alive_valid_tbl <- lsos(pattern = "_alive_valid_tbl") %>%
row.names() %>%
enframe(name = NULL, value = "variable_name") %>%
mutate(
label = str_replace(variable_name, "ltmodel_(.*?)_alive_valid.*", "\\1"),
data = map(variable_name, get)) %>%
unnest(data)
ggplot(p_alive_valid_tbl) +
geom_line(aes(x = top_prop, y = active_prop, colour = label)) +
labs(
x = "Cu"
)5 R Environment
options(width = 120L)
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-07-18
## pandoc 2.17.1.1 @ /usr/local/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
## arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.2.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.2.0)
## bayesplot * 1.9.0 2022-03-10 [1] RSPM (R 4.2.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
## bookdown 0.27 2022-06-14 [1] RSPM (R 4.2.0)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.0)
## bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.2.0)
## brms * 2.17.0 2022-07-04 [1] Github (paul-buerkner/brms@a43937c)
## Brobdingnag 1.2-7 2022-02-03 [1] RSPM (R 4.2.0)
## broom 0.8.0 2022-04-13 [1] RSPM (R 4.2.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.2.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.2.0)
## callr 3.7.0 2021-04-20 [1] RSPM (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.2.0)
## checkmate 2.1.0 2022-04-21 [1] RSPM (R 4.2.0)
## cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
## cmdstanr * 0.5.2 2022-07-04 [1] Github (stan-dev/cmdstanr@9bbc4eb)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.2.0)
## colourpicker 1.1.1 2021-10-04 [1] RSPM (R 4.2.0)
## conflicted * 1.1.0 2021-11-26 [1] RSPM (R 4.2.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.2.0)
## crayon 1.5.1 2022-03-26 [1] RSPM (R 4.2.0)
## crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.2.0)
## data.table 1.14.2 2021-09-27 [1] RSPM (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] RSPM (R 4.2.0)
## dbplyr 2.2.0 2022-06-05 [1] RSPM (R 4.2.0)
## digest 0.6.29 2021-12-01 [1] RSPM (R 4.2.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.2.0)
## distributional 0.3.0 2022-01-05 [1] RSPM (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] RSPM (R 4.2.0)
## DT 0.23 2022-05-10 [1] RSPM (R 4.2.0)
## dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0)
## evaluate 0.15 2022-02-18 [1] RSPM (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] RSPM (R 4.2.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.2.0)
## fs * 1.5.2 2021-12-08 [1] RSPM (R 4.2.0)
## furrr * 0.3.0 2022-05-04 [1] RSPM (R 4.2.0)
## future * 1.26.1 2022-05-27 [1] RSPM (R 4.2.0)
## gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.2.0)
## generics 0.1.2 2022-01-31 [1] RSPM (R 4.2.0)
## ggdist 3.1.1 2022-02-27 [1] RSPM (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] RSPM (R 4.2.0)
## ggridges 0.5.3 2021-01-08 [1] RSPM (R 4.2.0)
## globals 0.15.0 2022-05-09 [1] RSPM (R 4.2.0)
## glue * 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.2.0)
## gtools 3.9.2.2 2022-06-13 [1] RSPM (R 4.2.0)
## haven 2.5.0 2022-04-15 [1] RSPM (R 4.2.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.2.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.2.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.2.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.2.0)
## httpuv 1.6.5 2022-01-05 [1] RSPM (R 4.2.0)
## httr 1.4.3 2022-05-04 [1] RSPM (R 4.2.0)
## igraph 1.3.2 2022-06-13 [1] RSPM (R 4.2.0)
## inline 0.3.19 2021-05-31 [1] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [1] RSPM (R 4.2.0)
## knitr 1.39 2022-04-26 [1] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
## later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.2.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.2.0)
## lme4 1.1-29 2022-04-07 [1] RSPM (R 4.2.0)
## loo 2.5.1 2022-03-24 [1] RSPM (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [1] RSPM (R 4.2.0)
## magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.2.0)
## markdown 1.1 2019-08-07 [1] RSPM (R 4.2.0)
## MASS 7.3-56 2022-03-23 [2] CRAN (R 4.2.0)
## Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.0)
## matrixStats 0.62.0 2022-04-19 [1] RSPM (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0)
## mgcv 1.8-40 2022-03-29 [2] CRAN (R 4.2.0)
## mime 0.12 2021-09-28 [1] RSPM (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.2.0)
## minqa 1.2.4 2014-10-09 [1] RSPM (R 4.2.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
## mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0)
## nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.0)
## nloptr 2.0.3 2022-05-26 [1] RSPM (R 4.2.0)
## parallelly 1.32.0 2022-06-07 [1] RSPM (R 4.2.0)
## pillar 1.7.0 2022-02-01 [1] RSPM (R 4.2.0)
## pkgbuild 1.3.1 2021-12-20 [1] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0)
## plyr 1.8.7 2022-03-24 [1] RSPM (R 4.2.0)
## posterior * 1.2.2 2022-06-09 [1] RSPM (R 4.2.0)
## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.2.0)
## processx 3.6.1 2022-06-17 [1] RSPM (R 4.2.0)
## projpred 2.1.2 2022-05-13 [1] RSPM (R 4.2.0)
## promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0)
## ps 1.7.1 2022-06-18 [1] RSPM (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.2.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0)
## Rcpp * 1.0.8.3 2022-03-17 [1] RSPM (R 4.2.0)
## RcppParallel 5.1.5 2022-01-05 [1] RSPM (R 4.2.0)
## readr * 2.1.2 2022-01-30 [1] RSPM (R 4.2.0)
## readxl 1.4.0 2022-03-28 [1] RSPM (R 4.2.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.2.0)
## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.2.0)
## rlang * 1.0.2 2022-03-04 [1] RSPM (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [1] RSPM (R 4.2.0)
## rmdformats 1.0.4 2022-05-17 [1] RSPM (R 4.2.0)
## rstan 2.26.13 2022-07-04 [1] local
## rstantools 2.2.0 2022-04-08 [1] RSPM (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.2.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.2.0)
## sass 0.4.1 2022-03-23 [1] RSPM (R 4.2.0)
## scales * 1.2.0 2022-04-13 [1] RSPM (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
## shiny 1.7.1 2021-10-02 [1] RSPM (R 4.2.0)
## shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.2.0)
## shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.2.0)
## shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.2.0)
## StanHeaders 2.26.13 2022-07-04 [1] local
## stringi 1.7.6 2021-11-29 [1] RSPM (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.2.0)
## svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.2.0)
## tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.2.0)
## threejs 0.3.3 2020-01-21 [1] RSPM (R 4.2.0)
## tibble * 3.1.7 2022-05-03 [1] RSPM (R 4.2.0)
## tidybayes * 3.0.2 2022-01-05 [1] RSPM (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.2.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.2.0)
## V8 4.2.0 2022-05-14 [1] RSPM (R 4.2.0)
## vctrs * 0.4.1 2022-04-13 [1] RSPM (R 4.2.0)
## vroom 1.5.7 2021-11-30 [1] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0)
## xfun 0.31 2022-05-10 [1] RSPM (R 4.2.0)
## xml2 1.3.3 2021-11-30 [1] RSPM (R 4.2.0)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0)
## xts 0.12.1 2020-09-09 [1] RSPM (R 4.2.0)
## yaml 2.3.5 2022-02-21 [1] RSPM (R 4.2.0)
## zoo 1.8-10 2022-04-15 [1] RSPM (R 4.2.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)